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Abstract 


A new system to recognize protein coding genes in the coronavirus genomes, specially suitable for the SARS-CoV genomes, has 
been proposed in this paper. Compared with some existing systems, the new program package has the merits of simplicity, high 
accuracy, reliability, and quickness. The system ZCURVE_CoV has been run for each of the 11 newly sequenced SARS-CoV 
genomes. Consequently, six genomes not annotated previously have been annotated, and some problems of previous annotations in 
the remaining five genomes have been pointed out and discussed. In addition to the polyprotein chain ORFs La and 1b and the four 
genes coding for the major structural proteins, spike (S), small envelop (E), membrane (M), and nuleocaspid (N), respectively, 
ZCURVE_CoV also predicts 5—6 putative proteins in length between 39 and 274 amino acids with unknown functions. Some single 
nucleotide mutations within these putative coding sequences have been detected and their biological implications are discussed. A 
web service is provided, by which a user can obtain the annotated result immediately by pasting the SARS-CoV genome sequences 
into the input window on the web site (http://tubic.tju.edu.cn/sars/). The software ZCURVE_CoV can also be downloaded freely 


from the web address mentioned above and run in computers under the platforms of Windows or Linux. 


© 2003 Elsevier Inc. All rights reserved. 
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An outbreak of a life-threatening disease, referred to 
as severe acute respiratory syndrome (SARS), has 
spread to many countries around the world [1-6]. By 
late May 2003, the World Health Organization (WHO) 
has recorded more than 7000 cases of SARS and more 
than 600 SARS-related deaths, and therefore a global 
alert for the illness was issued due to the severity of the 
disease (http://www.who.int/csr/sars/en/). 

A growing body of evidence has convincingly shown 
that SARS is caused by a novel coronavirus, called 
SARS-coronavirus or SARS-CoV. Currently, the com- 
plete genome sequences of 11 strains of SARS-CoV 
isolated from some SARS patients have been sequenced 
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[7-9], and more complete genome sequences of SARS- 
CoV are expected to come. 

The SARS-CoV genomes are about 30kb in length. 
For such short genome sequences, currently, there is no 
reliable software for the identification of protein-coding 
genes. Therefore, most sequenced genomes were anno- 
tated manually or not annotated. Among the 11 com- 
pleted sequences, six were not annotated yet and the 
remaining were annotated manually. 

Currently, most algorithms for gene identification in 
prokaryotic genomes, such as GeneMark.hmm [10] and 
Glimmer [11], are based either on the higher-order 
Markov chain model or the hidden Markov chain model 
in which thousands of parameters need to be trained. 
The large number of parameters may result in less 
adaptability, especially for small genomes. Meanwhile, 
ZCURVE [12] is a newly developed system for gene 
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recognition in bacterial and archaeal genomes, in which 
only 33 parameters are used and the recognition accu- 
racy is high. Therefore, the ZCURVE algorithm essen- 
tializes the coding properties of protein-coding genes 
with relatively small number of parameters. Thus, it is 
not only suitable for large but also especially suitable for 
small genomes. 

In this paper, we describe a system, called ZCUR- 
VE_CoV, based on a coronavirus-specific ZCURVE 
algorithm, which is especially suitable for gene recog- 
nition in SARS-CoV genomes. The system has the ad- 
vantages of simplicity, reliability, high accuracy, and 
quickness. The software system ZCURVE_CoV is freely 
available at http://tubic.tju.edu.cn/sars/. 


Materials and methods 


Six genome sequences of coronaviruses and the annotation infor- 
mation were downloaded from the web site of NCBI RefSeq project 
(http://www.ncbi.nih.gov/RefSeq/). These coronaviruses include avian 
infectious bronchitis virus (NC_001451), bovine coronavirus 
(NC_003045), human coronavirus 229E (NC_002645), murine hepa- 
titis virus (NC_001846), porcine epidemic diarrhea virus (NC_003436), 
and transmissible gastroenteritis virus (NC_002306). A total of 48 
genes were extracted from the above six genomes and used to train the 
gene-finding algorithm. Currently, 15 genome sequences of SARS 
coronavirus (SARS-CoV) strains are available in the GenBank data- 
base, of which there are 11 complete and four partial genomes, re- 
spectively. The former includes SARS-CoV TOR2 (Accession No. 
AY274119), Urbani (AY278741), HKU-39849 (AY278491), CUHK- 
W1 (AY278554), BJO1 (AY278488), CUHK-Sul0 (AY282752), 
SIN2500 (AY283794), SIN2748 (AY283797), SIN2679 (AY283796), 
SIN2774 (AY283798), and SIN2677 (AY283795), whereas the latter 
includes SARS-CoV BJ02 (AY278487), BJO3 (AY278490), BJ04 
(AY279354), and GZ01 (AY278489), respectively. 

The gene-finding algorithm presented in this paper is based on the 
Z curve [13], which is a graphic representation of DNA sequences. The 
Z curve method has been used to recognize protein coding genes in the 
budding yeast genome [14]. A new ab initio gene-finding system for 
bacterial and archaeal genomes has been developed recently, based on 
the Z curve method [12]. Here the method with some modifications is 
used to recognize protein coding genes in coronavirus genomes, which 
is presented briefly as follows. Suppose that the occurrence frequencies 
of the bases A, C, G, and T (U) at the first, second, and third codon 
positions in an ORF are denoted by q;, c;, g;, and t;, respectively, 
where i = 1, 2,3. The four numbers, a;, c;, g;, and ¢;, are mapped onto 
a point in a 3-dimensional space V; with the coordinates 


xj = (ai + gi) — (C7 + Ki), 
vi (a; +e) — (8; TT ti), i= 1,2, 3, (1) 
2 = (a; + 4) — (gi + Gi). 


Then, each ORF may be represented by a point or a vector in a 
9-dimensional space V, where V = V; © V2 @ V3, where the symbol @ 
denotes the direct-sum of two subspaces. The nine components u—u9 
of the space V are defined as follows: 


m=X, W=\V, W=2Z1, 
U4=X2, Us =, Ue =22, (2) 


U7 =X3, Ug = Y3, Ug = 23. 


To train the system, two sets of samples are needed, which are 
positive samples corresponding to protein-coding genes (seed ORFs) 
and negative samples corresponding to non-coding sequences. In the Z 
curve method, essentially, the gene recognition is based on the com- 


positional asymmetry of three codon positions in coding sequences. It 
was shown that the overall extent of codon usage bias in RNA viruses 
is low and there is little variation in bias between genes [15]. Coro- 
naviruses belong to the coronaviridae and the G+C content of the 
published coronavirus genomes ranges from 37% to 42% [7]. There- 
fore, it is reasonable to deduce that the published coronavirus genomes 
have similar codon usage. Based on this consideration, it is possible 
that gene-finding parameters derived from some published coronavirus 
genomes may be applied to recognize genes in other coronavirus ge- 
nomes. Because the SARS-CoV genomes are relatively small (=30 kb), 
it is difficult to obtain enough seed ORFs from its own genome. 
Therefore, we used some other published coronavirus genomes to train 
gene-finding parameters. Consequently, the genomes of avian infec- 
tious bronchitis virus, bovine coronavirus, human coronavirus 229E, 
murine hepatitis virus, porcine epidemic diarrhea virus, and trans- 
missible gastroenteritis virus, respectively, were used, in which 48 seed 
ORFs were selected. The detailed information about the 48 seed ORFs 
is described in Table | of the supplementary materials (see: http://tu- 
bic.tju.edu.cn/sars/). 

Below we describe the strategy to produce the negative samples. It 
is a rather difficult problem to produce an appropriate set of non- 
coding sequences in coronavirus genomes, because the amount of non- 
coding DNA sequences in these genomes is too few to be used. A 
method to produce negative samples has been developed previously 
and it has been shown to be an effective way to solve the problem [12]. 
The same method is still used in the current study. In this method, a 
negative sample is just derived from a seed ORF. Generally speaking, 
if the regular structure of a coding sequence is completely destroyed, it 
is transformed into a non-coding one. Therefore, the negative sample 
may be simply obtained by shuffling the corresponding coding se- 
quence sufficiently (20,000 times in current study). The resulting ran- 
dom sequences from all 48 seed ORFs were used as non-coding 
sequences. The major difference is that the former has some regular 
structures, whereas the latter is a random sequence. In fact, a random 
sequence is not a non-coding sequence, but it is a good approximation. 
As shown below, this approximation generally results in good gene- 
finding results. 

The Fisher linear equation for discriminating the positive and 
negative samples in the 9-dimensional space V represents a super- 
plane, described by a vector ¢ which has nine components c),c2,..., 
and cy. For more details about Fisher discrminant algorithm, refer to, 
for example [14]. Based on the data in the training set (including the 
positive and negative samples), the vector ¢ and the threshold co are 
obtained. The decision of coding/non-coding for each ORF and neg- 
ative sample is simply made by the criterion of c-u > co/e-u < co, 
where ¢ = (c1,¢,... .c9)", u= (uj,U,... .uy)", and “T” indicates the 
transpose of a matrix. The criterion of ¢-u > co/e-u < co for making 
the decision of coding/non-coding can be rewritten as Z(u) > 0/ 
Z(u) < 0, where Z(u) = c-u— cp. Z(u) is called the Z score or Z index 
for an ORF or a fragment of DNA sequence. Finally, the strategy to 
deal with overlapping ORFs used here is similar to that described in 
the previous paper [12]. 


Results and discussions 
Comparison with the existing system—Gene Mark.hmm 


No. coronavirus-specific annotation systems have 
been available so far. Currently, GeneMark.hmm is 
commonly used for gene-finding in virus genomes [10]. 
We submitted the SARS-CoV TOR2 genome to Gene- 
Mark.hmm website using default settings and the pre- 
diction result is listed in Table 1. It can be seen that the 
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Table | 
The genes predicted by GeneMark.hmm for the SARS-CoV, TOR2 
strain 


Gene Start Stop Gene length (bp) 
1 <3 53 51 
2 265 13,413 13,149 
3 13,599 21,485 7887 
4 21,492 25,259 3768 
5 25,268 26,092 825 
6 26,398 27,063 666 
7 27,074 27,265 192 
8 27,273 27,641 369 
9" 27,864 28,118 255 

10* 28,130 28,426 297 

11* 28,423 29,388 966 


“The same genome was submitted several times to the website, 
however, the prediction results were not identical at all times, indi- 
cating that the system is unstable. An important structural protein 
gene (N protein), which is located from 28120 to 29388, was predicted 
as ‘gene 10’ and ‘gene 11’ in some predicted results. Sometimes, ‘gene 
9,’ a quite conserved ORF in all of the 11 SARS-CoV genomes men- 
tioned above, was not predicted. In addition, the gene coding for a 
structural protein (small envelope protein E) was also missed by the 
prediction. For more details, see the text. 


predicted ‘gene 1’ is questionable, because of its short 
length and the lack of a start codon. An important 
structural protein gene (small envelope protein E), 
which is located from 26117 to 26347, was not predicted 
by GeneMark.hmm. Moreover, we submitted the same 
genome sequence several times to the website, however, 
the prediction results were not identical at all times, 
indicating that the system is unstable. An important 
structural protein gene (N protein), which is located 
from 28120 to 29388, was predicted as ‘gene 10’ and 
‘gene 11’ (marked with * in Table 1) in some predicted 
results. Sometimes, “gene 9’ (marked with * in Table 1), 
a quite conserved ORF in all of the 11 SARS-CoV ge- 
nomes mentioned above, was not predicted. Compared 
with GeneMark.hmm for gene-finding in the SARS- 
CoV genomes, the performance of ZCURVE_CoV is 
better (see Table 3 in the supplementary materials). 


Apply ZCURVE_CoV to analyze the SARS-CoV ge- 


nomes 


Currently, the genome sequences of 15 SARS-CoV 
strains are available in GenBank/EMBL databases, of 
which there are 11 complete and four partially complete 
genomes. The gene-finding software ZCURVE_CoV 
Version 1.0 has been run for each of the 11 complete 
SARS-CoV genomes. To save space, the detailed results 
are listed in Table 3 of the supplementary materials (see 
also the discussion below). In addition to the polypro- 
tein chain ORFs la and 1b, the program predicts four 
structural genes coding for the four major structural 
proteins, i.e., spike (S), small envelop (E), membrane 
(M), and nuleocaspid (N), respectively, in all the 11 


SARS-CoV genomes. Additionally, ZCURVE_CoV 1.0 
also predicts 5—6 putative proteins with lengths between 
39 and 274 amino acids for the 11 genomes. These pu- 
tative genes might code for non-structural proteins in 
the SARS-CoV genomes. 

To compare the gene-finding result of the system 
ZCURVE_CoV 1.0 with that of known annotation, the 
SARS-CoV TOR2 strain is used as an example. The 
genome of TOR2 strain was annotated manually [8] and 
the annotated result is listed on the left part of Table 2, 
whereas the annotated result of ZCURVE_CoV 1.0 is 
listed on the right part of Table 2. As we can see both 
annotations are in good agreement with each other, 
except three ORFs. The three ORFs, i.c., ORF4, 
ORF13, and ORF14 annotated by Marra et al. [8] are 
not predicted by ZCURVE_CoV 1.0. These ORFs are 
completely embedded, with a frameshift, within the 
genes coding for some structural proteins. The absence 
of the transcription regulating sequences (TRSs) at the 
5’ end of these ORFs [8] suggests that they are unlikely 
to be the protein-coding genes. The principal component 
analysis performed below further confirms the above 
conjecture. As mentioned in the Materials and methods 
section, each ORF is represented by a point in a 9-di- 
mensional (9-D) space. Consequently, the positive 
samples (genes) and negative samples (non-coding se- 
quences) are represented by two groups of points in the 
9-D space, respectively. For the TOR2 strain, the 12 
putative genes predicted by ZCURVE_CoV and ORF 4, 
ORF 13, and ORF 14 are represented by the corre- 
sponding points in the 9-D space, respectively. We 
project the points in the 9-D space onto the 3-D space 
spanned by the first, second, and third principal axes 
based on the principal component analysis. The fraction 
of the first three principal components accounts for 
about 70% of the total inertia of the 9-D space. Fig. 1 
shows the distribution of the corresponding points in the 
3-D space, where green and orange balls represent the 
positive samples (genes) and negative samples (non- 
coding sequences), respectively. Blue balls correspond to 
the genes predicted by ZCURVE_CoV for the TOR2 
strain, while red balls correspond to ORF 4, ORF 13, 
and ORF 14 annotated by Marra et al. [8]. It is clear 
that the three red balls are located at the side of non- 
coding sequences, indicating that ORF 4, ORF 13, and 
ORF 14 are very unlikely to code for proteins. 

Similar analysis was performed to the Urbani strain 
[7]. The result is listed in Table 3, in which the putative 
gene X2 annotated by Rota et al. [7], corresponding to 
ORF 4 in Marra et al. [8], is not predicted by ZCUR- 
VE_CoV. Based on the above analysis, X2 is also very 
unlikely to code for a protein. Of the 11 complete 
SARS-CoV genomes, six have not yet been annotated. 
We have run the program ZCURVE_CoV for each of 
the 11 genomes. Consequently, those already annotated 
have been re-annotated and those not annotated yet 
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Table 2 


Comparison of the genes annotated and those predicted by ZCURVE_CoV 1.0, for the SARS-CoV, TOR2 strain 


Genes annotated 


Genes predicted by ZCURVE_CoV 1.0 


Start Stop bp a.a. 


Feature 
265 13,398 13,134 4377 ORF la 
13,398 21,485 8088 2695 ORF 1b 
21,492 25,259 3768 1255 S protein 
25,268 26,092 825 274 ORF 3 
25,689 26,153 465 154 ORF 4 
26,117 26,347 231 716 E protein 
26,398 27,063 666 221 M protein 
27,074 27,265 192 63 ORF 7 
27,273 27,641 369 122 ORF 8 
27,638 27,772 135 44 ORF 9 
27,779 27,898 120 39 ORF 10 
27,864 28,118 255 84 ORF 11 
28,120 29,388 1269 422 N protein 
28,130 28,426 297 98 ORF 13 
28,583 28,795 213 70 ORF 14 


Start Stop bp a.a. Feature 
265 13,398" 13,134 4377 ORF la 
13,398? 21,485 8088 2695 ORF 1b 
21,492 25,259 3768 1255 S protein 
25,268 26,092 825 274 Sars274 
26,117 26,347 231 76 E protein 
26,398 27,063 666 221 M protein 
27,074 27,265 192 63 Sars63 
27,273 27,641 369 122 Sars122 
27,638 27,772 135 44 Sars44 
27,779 27,898 120 39 Sars39 
27,864 28118 255 84 Sars84 
28,120 29,388 1269 422 N protein 


* The program ZCURVE_CoV 1.0 has two options. The default option is to use the heptamer UUUAAAC as the conservative ‘slippery sequence’ 
to find the coronavirus —1 frameshift site [16]. Once the heptamer is found in the upstream sequence near the ending site of ORF la originally 
predicted, the ending site of ORF la and starting site of ORF 1b are both corrected to the frameshift site (13398 in this genome) according to this 
‘slippery sequence.’ Otherwise, if this heptamer cannot be found, only the original sites predicted for ORF la and ORF 1b are displayed in the output 
file. The second option is to ignore the —1 frameshift, and the original sites predicted for ORF la and ORF 1b are always displayed, regardless of the 


existence of the heptamer UUUAAAC. 


Table 3 


Comparison of the genes annotated and those predicted by ZCURVE_CoV1.0, for the SARS-CoV, Urbani strain 


Genes annotated 


Genes predicted by ZCURVE_CoV 1.0 


Start Stop bp a.a. 


Feature 
265 13,398 13,134 4377 ORF la 
13,398 21,485 8088 2695 ORF 1b 
21,492 25,259 3768 1255 S protein 
25,268 26,092 825 274 Xl 
25,689 26,153 465 154 x2 
26,117 26,347 231 716 E protein 
26,398 27,063 666 221 M protein 
27,074 27,265 192 63 X3 
27,273 27,641 369 122 x4 
27,864 28,118 255 84 x5 
28,120 29,388 1269 422 N protein 


Start Stop bp a.a. Feature 
265 13,398" 13,134 4377 ORF la 
13,3988 21,485 8088 2695 ORF 1b 
21,492 25,259 3768 1255 S protein 
25,268 26,092 825 274 Sars274 
26,117 26,347 231 76 E protein 
26,398 27,063 666 221 M protein 
27,074 27,265 192 63 Sars63 
27,273 27,641 369 122 Sars122 
27,638 27,772 135 44 Sars44 
27,779 27,898 120 39 Sars39 
27,864 28,118 255 84 Sars84 
28,120 29,388 1269 422 N protein 


*See the footnote in Table 2. 


have been annotated. All of the annotated results are 
listed in Table 3 of the supplementary materials. 


Analyze the mutations of the six putative non-structural 
genes by sequence alignment 


To test the nucleotide mutations of the predicted 
genes coding for non-structural proteins, we aligned the 
coding sequences of Sars274, Sars63, Sars122, Sars44, 
Sars39, and Sars84, respectively, for the 11 complete 
SARS-CoV genomes using ClustalW 1.8 [17]. The results 
of multiple sequence alignment for the above six pre- 
dicted genes coding for non-structural proteins are listed 


in Fig. 1 of the supplementary materials. For the three 
ORFs, Sars122, Sars44, and Sars84, the nucleotide se- 
quences are all conserved in the 11 SARS-CoV genomes, 
indicating that the three ORFs might have crucial bio- 
logical functions. Mutations in these gene sequences 
would result in loss of important functions. Therefore, 
these coding sequences might serve as the candidate 
targets for designing drugs against SARS. On the con- 
trary, Sars39 is not found in the strains SIN2677 and 
SIN2748, and a nucleotide mutation occurs at nucleotide 
position 49, leading to the mutation of Cys — Arg in the 
strains BJO! and CUHK-WI1. The rapid mutations oc- 
curring in Sars39 imply that it is probably not a key 
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® Non-gene 
@ Predicted gene 


Fig. 1. Distribution of the mapping points corresponding to genes, non-genes, predicted genes, and questionable ORFs for the SARS-CoV, TOR2 
strain in a 3-dimensional (3-D) space. Each gene or ORF is mapped onto a point in a 9-D space. To visualize the distribution, the mapping points are 
projected onto the 3-D space spanned by the first three principal axes based on the principal component analysis. The first, second, and third 
principal vectors are denoted by the X-, Y-, and Z-axes, respectively. The fraction of the first three principal components accounts for 69.59% of the 
total inertia of the 9-D space. Green and orange balls represent the positive samples (genes) and negative samples (non-coding sequences), re- 
spectively. Blue balls correspond to the genes predicted by ZCURVE_CoV for the TOR2 strain, while red balls correspond to ORF 4, ORF 13, and 
ORF 14 annotated by Marra et al. [8]. It is clear that the three red balls are situated at the side of non-coding sequences, indicating that ORF 4, ORF 


13, and ORF 14 are very unlikely to code for proteins. 


protein for SARS-CoV. For Sars63, two nucleotide 
mutations are observed at the base positions 38 and 170, 
leading to amino acid mutations of Glu—Gly and 
Pro — Leu in the strains SIN2677 and BJO1, respectively. 
See Fig. 1 in the supplementary materials for the detail. 

The result of ClustalW alignment for Sars274 is 
shown in Fig. 2. Four nucleotide mutations, located at 
31, 302, 406, and 783, respectively, at three different 


strains have been detected. The first three variations 
cause amino acid mutations (Fig. 2). The last substitu- 
tion is a synonymous codon mutation which does not 
lead to amino acid change. The point mutations occur- 
ring at nucleotide positions 31, 302, and 406, respec- 
tively, cause amino acid changes. At the 31st position, 
G—A (TOR2)=>Gly— Arg. Similarly, at the 302nd 
position, T (U) > A (HK U-39849) = Met — Lys; and at 


a B1. 302 406 783 825 
TOR2 ATGGATTTGT.....CTCTTAGATCA.....AGGTATGGAGG..... AATCCAAGAACC.....GATCCAATTTA.....GCCTTTGTAA 
HKU-39849 ATGGATTTGT.....CTCTTGGATCA.....AGGTAAGGAGG..... AATCCAAGAACC.....GATCCAATTTA.....GCCTTTGTAA 
Urbani ATGGATTTGT.....CTCTTGGATCA..... AGGTATGGAGG......AATCCAAGAACC......GATCCAATTTA.....GCCTTTGTAA 
SIN2748 ATGGATTTGT.....CTCTTGGATCA..... AGGTATGGAGG..... AATCCAAGAACC......GATCCAATTTA.....GCCTTTGTAA 
SIN2774 ATGGATTTGT.....CTCTTGGATCA..... AGGTATGGAGG..... AATCCAAGAACC.....GATCCAATTTA.....GCCTTTGTAA 
SIN2679 ATGGATTTGT.....CTCTTGGATCA.....AGGTATGGAGG.....AATCCAAGAACC.....GATCCAATTTA.....GCCTTTGTAA 
SIN2677 ATGGATTTGT.....CTCTTGGATCA..... AGGTATGGAGG..... AATCCAAGAACC.....GATCCAATTTA.....GCCTTTGTAA 
SIN2500 ATGGATTTGT.....CTCTTGGATCA.....AGGTATGGAGG..... AATCCAAGAACC.....GATCCAATTTA.....GCCTTTGTAA 
CUHK-Sul10 ATGGATTTGT.....CTCTTGGATCA.....AGGTABGGAGG..... AATCCAAGAACC.....GATCCAATTTA.....GCCTTTGTAA 
CUHK-W1 ATGGATTTGT.....CTCTTGGATCA..... AGGTATGGAGG..... AATCCAAGAACC.....GATCCAATTTA.....GCCTTTGTAA 
BJ01 ATGGATTTGT.....CTCTTGGATCA.....AGGTATGGAGG..... AATCCEAGAACC.....GATCCEATTTA.....GCCTTTGTAA 
Mutations: 
TOR2 Codon (31) GGA — AGA Amino acid (11) Gly — Arg 
HKU-39849 Codon (302) ATG —> AAG Amino acid (101) Met ~— Lys 
BJ01 Codon (406) AAG — CAG Amino acid (135) Lys — Gln 


Fig. 2. Nucleotide mutations of the predicted gene Sars274 based on the alignment of corresponding coding sequences in 11 complete genome se- 
quences. A total of four point mutations are detected, of which one is a silent mutation and the other three cause amino acid changes in the putative 
genes. The point mutations occur at nucleotide positions 31, 302, 406, and 783, respectively. At the 31st position, G— A (TOR2)=> Gly— Arg. 
Similarly, at the 302nd position, T (U) — A (HKU-39849) => Met — Lys; at the 406th position, A— C (BJ01) = Lys— Gln; and at the 783rd position, 


A—C (BJO01), but no amino acid change. 
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the 406th position, A— C (BJ01) = Lys — Gln. On the 
other hand, it was reported by Marra et al. [8] that there 
exist three trans-membrane regions spanning approxi- 
mately at nucleotide positions 102 — 168 (residues 34 > 
56), 231-297 (77-99), and 309 375 (103 — 115), 
respectively, in Sars274 sequence. Therefore, the muta- 
tions occur outside of the predicted trans-membrane 
regions. Note that the second mutation of amino acid 
(Met — Lys) is essential, as reflected by the fact that Met 
is a relatively strong hydrophilic amino acid, whereas 
Lys is a strong hydrophobic one. At present, we cannot 
know whether these mutations cause severe conforma- 
tional changes in the tertiary structure of this putative 
protein. The high mutation rate of Sars274 implies that 
either it might be a relatively unimportant protein for 
SARS-CoV, or the mutations do not lead to biological 
function changes dramatically. Finally, for the time 
being we still cannot rule out the possibility that all or a 
part of these mutations are caused by sequencing errors. 


Supplementary materials 


The detailed supplementary materials related to this 
study are available from the website http://tubic.tju. 
edu.cn/sars/, which includes the following content: 

(a) Table 1. The 48 seed ORFs and the six corona- 
virus genomes from which the seed ORFs are derived. 

(b) Table 2. The Fisher coefficients and threshold 
obtained from the seed ORFs. 

(c) Table 3. Results of gene-finding using ZCUR- 
VE_CoV for the 11 SARS-CoV complete genomes. 

(d) Fig. 1. The results of multiple sequence alignment 
of the six predicted genes coding for non-structural 
proteins, Sars274, Sars63, Sars122, Sars44, Sars39, and 
Sars84, respectively. 


Online service and availability of the program ZCUR- 
VE_CoV 


A web interface of the ZCURVE_CoV system has 
been constructed. When a user pastes a SARS-CoV 
genome sequence to the input window of the website, 
the gene-finding result will be returned to the user im- 
mediately. A user may also download the executable 
version of the program ZCURVE_CoV and run it on 
the computers under the platforms of either Windows 
(95/98/NT/Me/2000 or higher), or Linux (Redhat 7.1 or 
higher), or SGI IRIX 6.5. For more detailed informa- 
tion, visit: http://tubic.tju.edu.cn/sars/. 


Conclusion 
Severe acute respiratory syndrome (SARS) is an ex- 


tremely severe disease that has spread to many countries 
around the world. Accumulating evidence has shown 


that SARS is caused by a new coronavirus, i.e., SARS- 
CoV. A new system to recognize protein-coding genes in 
SARS-CoV genomes, called ZCURVE_CoV, has been 
reported in this paper. By applying the program to 11 
complete SARS-CoV genomes, six genomes not anno- 
tated previously have been annotated, and some prob- 
lems of previous annotations in the remaining five 
genomes have been pointed out and discussed. It is 
shown that the three protein-coding ORFs annotated by 
Marra et al. [8], i.e., ORF 4, ORF 13, and ORF 14, are 
very unlikely to code for proteins. In addition to 
ORFla, ORFI1b, and the four genes coding for the 
major structural proteins S, E, M, and N, the new sys- 
tem ZCURVE_CoV also predicts 5-6 putative genes 
coding for non-structural proteins. Aligning each of the 
non-structural gene sequences based on the 11 complete 
genomes, some mutations have been detected. The 
biological implications of the mutations have been 
discussed. 
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